SF3B1 iCLIP analysis

Binding site definition

Author
Affiliation
Dr. Mirko Brueggemann

Buchman Institute for Molecular Life Sciences

Published

September 18, 2023

1 Analysis Description

This report holds all analysis and plots for xxx

This report shows downstream processing of computed binding sites for the imb_koenig_2016_13_03 dataset. All replicates (3 wt and 2 mut) were down-sampled to the the counts in the lowest mutant replicate and merged for peak calling with PureCLIP. Processed samples are:

  • imb_koenig_2016_13_03_wt_2
  • imb_koenig_2016_13_03_wt_3
  • imb_koenig_2016_13_03_wt_4
  • imb_koenig_2016_13_03_mut_1
  • imb_koenig_2016_13_03_mut_3 (downsample seed)

1.1 Load libraries

Show code
library(rtracklayer)
library(GenomicRanges)
library(ggplot2)
library(AnnotationDbi)
library(dplyr)
library(reshape2)
library(UpSetR)
library(GenomicFeatures)
library(kableExtra)
library(knitr)
library(ggrepel)
library(gridExtra)
library(grid)
library(viridis)
library(BindingSiteFinder)
library(ComplexHeatmap)
library(forcats)
library(ggtext)
library(patchwork)
library(tibble)
library(tidyr)
library(dplyr)
library(ggpointdensity)
library(ggsci)
library(ggsci)
library(ggtext)
library(waffle)
library(ggrepel)
library(patchwork)
Show code
source("../styles.R")
source("../helper.R")

2 Library downsampling

Due to sequencing depth and library complexity variations, we decided to initially down-sample all replicates to the sample with the overall lowest number of reads (MUT 2). All downstream analysis are performed on the down-sampled data, removing the need of any additional library size normalization.

Show code
df = data.frame(
  sample = c("mut1", "mut2", "wt1", "wt2", "wt3"),
  before = c(19005395, 13892358, 28711869, 20679517, 30222832),
  after =  c(13874238, 13892358, 13780540, 13857030, 13605493)
) %>% pivot_longer(-sample)

ggplot(df, aes(x = sample, y = value, fill = name)) +
  geom_col(position = "dodge") +
  theme_nice() +
  coord_flip() +
  scale_fill_npg() +
  theme(legend.position = "top") +
  labs(
    title = "iCLIP downsampling",
    x = "Sample",
    y = "Number of reads",
    fill = "Downsampling"
  ) + 
    geom_text(aes(label = myFormat(value)), position = position_dodge(width = 0.8), angle = 45, size = 2)

Downsampling of iCLIP samples to the samples with the lowest number of crosslinks.

3 Gene Annotation

Annotations are downloaded from GENCODE v36 (Release 36 (GRCh38)). Annotations are then filtered by feature and transcript annotation level.

  • Feature annotation

    • keep level 1 and 2
    • remove level 3
  • Gencode definition

    • 1 = verified loci, 2 = manually annotated loci,3 = automatically annotated loci
  • Transcript level annotation

    • keep level 1,2 and 3
    • remove level 4,5 and NA
  • Gencode definition

    • 1 (all splice junctions of the transcript are supported by at least one non-suspect mRNA),
    • 2 (the best supporting mRNA is flagged as suspect or the support is from multiple ESTs),
    • 3 (the only support is from a single EST),
    • 4 (the best supporting EST is flagged as suspect),
    • 5 (no single transcript supports the model structure),
    • NA (the transcript was not analyzed)
Show code
load("/Users/mirko/Projects/Annotations/human/gencode_36/filtered/gencode_v36_filtered.rda")
anno.db = loadDb("/Users/mirko/Projects/Annotations/human/gencode_36/filtered/gencode_v36_filtered.sqlite")
gns = genes(anno.db)
idx = match(gns$gene_id, anno$gene_id)
elementMetadata(gns) = cbind(elementMetadata(gns), elementMetadata(anno)[idx,])

4 Preprocess pureCLIP output

Initially peaks are called using PureCLIP as described in the methods section. Here we pre-filter all PureCLIP called crosslink sites to keep only the most informative proportion.

Show code
peaksInitial = "/Users/mirko/Projects/sf3b1/01_data_subsamp/combined/pureCLIP/PureCLIP.crosslink_sites_mod.bed"
peaksInitial = import(con = peaksInitial, format = "BED")

6 Replicate reproducibility

Since the initially used crosslink sites are computed from the merged signal of all replicates, binding sites resulting from the previous merge might not be reproducible among all replicates. For that reason, we specifically check which of the computed binding sites are reproduced by the individual replicates.

All crosslink events within each binding site are summed up per replicate. The individual threshold for each replicate is set to the 5% quantile of the crosslink distribution. To account for low cross-link replicates, a lower boundary of a minimum of 2 crosslink events per binding site is enforced. For each binding site all replicate from each condition must meet the defined threshold.

Show code
bdsMerge = reproducibilityFilter(bdsMerge, cutoff = c(0.05, 0.05), minCrosslinks = 2, nReps = c(2,3))
Show code
reproducibilityFilterPlot(bdsMerge)

Distribution of summed up crosslinks for each replicate. The count threshold for each repliacte is indicated by a grey line (5% quantile).

Show code
reproducibilitySamplesPlot(bdsMerge)

Overview of binding sites that are shared between replicates. A binding site is reproducible if supported by 2 replicates from the MUT or 3 replicates from the WT condition.

Show code
reproducibilityScatterPlot(bdsMerge)

Pairwise crosslink correlation among replicates of both conditions after reproducibility filtering.

7 Genomic target identification

A major question the follow from binding site definition is the assessment of the genomic targets that SF3B1 binds to. In the following section we first assign computed binding sites to target genes and then place those that match protein coding genes on annotated transcript regions.

7.1 Target gene identification

Show code
selectTerms = c("protein_coding", "tRNA", "lincRNA", "snRNA")
rule = unique(gns$gene_type)
rule = rule[!rule %in% selectTerms]
rule = c(selectTerms, rule)
Show code
peaksReproducible = getRanges(bdsMerge)
targets = subsetByOverlaps(gns, peaksReproducible)
df = findOverlaps(targets, peaksReproducible) %>% as.data.frame()

# split into easy and complex cases
idxDouble = df[duplicated(df$subjectHits),]
idxSingle = df[!duplicated(df$subjectHits),]

# handle single overlap cases
peaksRepoSingle = peaksReproducible[idxSingle$subjectHits]
mcols(peaksRepoSingle)$geneType = targets$gene_type[idxSingle$queryHits]
mcols(peaksRepoSingle)$geneName = targets$gene_name[idxSingle$queryHits]
mcols(peaksRepoSingle)$geneID = targets$gene_id[idxSingle$queryHits]

# handle multi overlap cases
peaksRepoDouble = peaksReproducible[idxDouble$subjectHits]

peaksRepoDoubleCleaned = as(lapply(seq_along(peaksRepoDouble), function(x){
  currPeak = peaksRepoDouble[x]
  currTargets = subsetByOverlaps(targets, currPeak)
  nOverlaps = length(currTargets)
  # 1) take gene type as first criterion
  # -> prefer the type that is first in the `rule` list
  solution = unique(match(currTargets$gene_type, rule))
  nSolutions = length(solution)
  
  if (nSolutions == nOverlaps) {
    # solution successfull
    mcols(currPeak)$geneType = currTargets$gene_type[min(solution)]
    mcols(currPeak)$geneName = currTargets$gene_name[min(solution)]
    mcols(currPeak)$geneID = currTargets$gene_id[min(solution)]
  }
  if (nSolutions < nOverlaps) {
    # no solution found 
    # -> Stop and return NA
    mcols(currPeak)$geneType = NA
    mcols(currPeak)$geneName = NA
    mcols(currPeak)$geneID = NA
  }
  return(currPeak)
}),"GRangesList")
peaksRepoDoubleCleaned = unlist(peaksRepoDoubleCleaned)
peaksRepoDoubleCleaned = peaksRepoDoubleCleaned[!is.na(peaksRepoDoubleCleaned$geneID)]

# assign peaks
bsGene = c(peaksRepoSingle, peaksRepoDoubleCleaned)
bsGene = sortSeqlevels(bsGene)
bsGene = sort(bsGene)
bsGene = unique(bsGene)

# assign targets
targetsGene = targets[targets$gene_id %in% bsGene$geneID]

Here we match binding sites with their hosting genes. Due to the degree of overlapping gene loci in the annotation some binding sites can not be unabigously mapped to a hosting gene. To recover these cases we implement a strategy that first looks for the most frequent gene annotations among overlapping cases and if these yield a tie assignment is followed by the hierarchical order: protein_coding, tRNA, lincRNA, snRNA, transcribed_unitary_pseudogene, transcribed_unprocessed_pseudogene, lncRNA, polymorphic_pseudogene, transcribed_processed_pseudogene, IG_C_gene, unprocessed_pseudogene, unitary_pseudogene, TEC, processed_pseudogene, translated_processed_pseudogene.

Show code
# # count overlaps per peak and gene
df = findOverlaps(targets, peaksReproducible) %>% as.data.frame()
df$geneType = targets$gene_type[df$queryHits]
df = df %>% 
  group_by(subjectHits) %>% 
  summarise(olType = paste0(length(geneType), " annotations overlapping")) %>%
  dplyr::select(olType) %>%
  as.data.frame()

df = basicVectorToNiceDf(df)
# make plot
ggplot(df, aes(x = Type, y = Freq, fill = Type, label = labelNice)) +
    geom_col(color = "black") +
    geom_text(data = df, aes(x = Type, y = 0), size = 4, color = "grey", hjust = -.3) +
    scale_y_log10() +
    coord_flip(clip = "on", expand = TRUE) +
    theme_pub() +
    scale_fill_viridis(discrete = TRUE, direction = -1, option = "B") +
    theme(legend.position = "none") +
    labs(
        title = "Binding sites overlapping multiple annotations",
        x = "",
        y = "Count")

Pairwise crosslink correlation among replicates of both conditions after reproducibility filtering.

Show code
# NOTE to this plot:
# -> shows how much the rule/ hierarchy affects the BS to gene assignment
# ---> many overlaps >1 means many BS will be assigned to their host gene by our rule
# -> overlaps can result in genes with different types, or genes with multiple types
Show code
df1 = data.frame(GeneType = (bsGene$geneType), type = "Peak") %>%
    mutate(GeneType = ifelse(grepl("pseudogene", GeneType), "pseudogene", GeneType)) %>%
    table() %>%
    as.data.frame() %>%
    mutate(label = paste0(format(Freq, big.mark = ",", decimal.mark = "."),
                          " (", format(round((Freq / sum(Freq))*100, digits = 2),
                                       big.mark = ",", decimal.mark = "."),")")) %>%
    mutate(GeneType = factor(GeneType, levels = c(GeneType[order(Freq)])))

ggplot(df1, aes(x = GeneType, y = Freq, fill = GeneType, label = label)) +
    geom_col(color = "black") +
    geom_text(aes(y = 0 ), size = 4, color = "grey", hjust = -.3) +
    scale_y_log10() +
    coord_flip(clip = "on", expand = TRUE) +
    scale_fill_npg() +
    theme_pub() +
    theme(legend.position = "none") +
    labs(title = "Binding spectrum - peaks",
         y = "Count",
         x = "Gene type") 

Overlap resolved binding spectrum for binding sites summarized in the top 3 most frequent gene types.

Show code
df2 = data.frame(GeneType = (targetsGene$gene_type), type = "Targets") %>%
    mutate(GeneType = ifelse(grepl("pseudogene", GeneType), "pseudogene", GeneType)) %>%
    table() %>%
    as.data.frame() %>%
    mutate(label = paste0(format(Freq, big.mark = ",", decimal.mark = "."),
                          " (", format(round((Freq / sum(Freq))*100, digits = 2),
                                       big.mark = ",", decimal.mark = "."),")")) %>%
    mutate(GeneType = factor(GeneType, levels = c(GeneType[order(Freq)])))

ggplot(df2, aes(x = GeneType, y = Freq, fill = GeneType, label = label)) +
  geom_col(color = "black") +
  geom_text(aes(y = 0 ), size = 4, color = "grey", hjust = -.3) +
  scale_y_log10() +
  coord_flip(clip = "on", expand = TRUE) +
  scale_fill_npg() +
  theme_pub() +
  theme(legend.position = "none") +
  labs(title = "Binding spectrum - targets",
       y = "Count",
       x = "Gene type") 

Overlap resolved binding spectrum for target genes summarized in the top 3 most frequent gene types.

7.2 Transcript region identification

Show code
rule = c("intron", "cds", "utr3", "utr5")

To identify hosting transcript regions for each binding sites we overlap binding sites of protein-coding genes with the respective transcript regions (Introns, CDS, UTRs). Overlaps within transcripts are resolved by applying a majority vote system and ties are further resolved in a hierarchical manor with a fall back rule intron, cds, utr3, utr5 in the case of ties.

Show code
targetsProt = targetsGene[targetsGene$gene_type == "protein_coding"]
bsProt = bsGene[bsGene$geneType == "protein_coding"]

# export(bsProt, "./data/bsProt.bed", format = "BED")

### count the overlap of each binidng site within each part of the gene
cdseq = cds(anno.db) %>% countOverlaps(bsProt,.)
intrns = unlist(intronsByTranscript(anno.db)) %>% countOverlaps(bsProt,.)
utrs3 = unlist(threeUTRsByTranscript(anno.db)) %>% countOverlaps(bsProt,.)
utrs5 = unlist(fiveUTRsByTranscript(anno.db)) %>% countOverlaps(bsProt,.)
count.df = data.frame(cds = cdseq, intron = intrns, utr3 = utrs3, utr5 = utrs5)

### applying the majority vote
count.df = count.df[, rule] %>% 
  as.matrix %>% 
  cbind.data.frame(., outside = ifelse(rowSums(count.df) == 0, 1, 0) )
names = colnames(count.df)
reg = apply(count.df, 1, function(x){ names[which.max(x)] })
### add region annotation to binding sites object
mcols(bsProt)$region = reg
Show code
plotCountDf = count.df
plotCountDf[plotCountDf > 1] = 1
m = make_comb_mat(plotCountDf)
ha = HeatmapAnnotation(
  "Intersections" = anno_barplot(comb_size(m), border = FALSE, gp = gpar(fill = "#595959"), height = unit(6, "cm"))
)

ht = UpSet(m,
           comb_order = order(comb_size(m), decreasing = T),
            top_annotation = ha,
           comb_col = "cornflowerblue", bg_col = "white", pt_size = unit(.5, "cm") ,
           border = T, lwd = 2, bg_pt_col = "#333333"
)

ss = set_size(m)
cs = comb_size(m)
ht = draw(ht,  padding = unit(c(0, 0, 10, 0), "mm"))
od = column_order(ht)
decorate_annotation("Intersections", {
    grid.text(format(cs[od], big.mark = ",", decimal.mark = "."), x = seq_along(cs), y = unit(cs[od], "native") + unit(.1, "pt"), 
        default.units = "native", just = c("left", "bottom"), 
        gp = gpar(fontsize = 8, col = "black"), rot = 45)
})

Overlap of binding sites with different transcript regions. Conflicting transcript annotations are resolved by transcript reigon.

Show code
#
### remove binding sites outside of annotated regions
bsTranscript = bsProt[bsProt$region != "outside"]
targetsTranscript = targetsProt[targetsProt$gene_id %in% bsTranscript$geneID]

### make nice pie chart
df = data.frame(Type = names(table(bsTranscript$region)), Freq = as.vector(table(bsTranscript$region)))
df = df[order(df$Freq, decreasing = F),]
df$Type = factor(df$Type, levels = df$Type)
df$Frac = df$Freq / sum(df$Freq)
df$ymax = cumsum(df$Frac)
df$ymin = c(0, head(df$ymax, n=-1))
df$labPos = (df$ymax + df$ymin) / 2
df$NFrac = round(df$Frac * 100)
df$NFrac2 = round(df$Freq / sum(df$Freq), digits = 4)
df$NFracNice = df$NFrac2 * 100
df$labelNice = paste0(format(df$Freq, big.mark = ",", decimal.mark = "."), " (", df$NFracNice, "%)")
df$labelNice2 = paste0(df$Type, ": ", format(df$Freq, big.mark = ",", decimal.mark = "."), " (", df$NFracNice, "%)")

ggplot(df, aes(x = Type, y = Freq, fill = Type, label = labelNice)) +
    geom_col() +
    geom_text(data = df, aes(x = Type, y = 0), size = 6, color = "lightgrey", hjust = -.3) +
    scale_y_log10() +
    coord_flip(clip = "on", expand = TRUE) +
    scale_fill_npg() +
    theme(legend.position = "none") +
    labs(title = "Binding spectrum",
         subtitle = "Bar-chart (absolute values)",
         y = "Number of binding sites",
         x = NULL) +
    theme_pub() +
    theme(aspect.ratio = 1, legend.position = "none") 

SF3B1 transcript binding spectrum. Percentage binding sites count per region.

7.3 Region size normalization

When assessing the number of binding sites per transcript region, the length of the hosting region has a strong effect on the raw number of counted binding sites. Here we use the mean region length to normalize for this effect.

Show code
cdsLen = cds(anno.db) %>% 
  subsetByOverlaps(., bsTranscript) %>% width %>% sum
intrnsLen = unlist(intronsByTranscript(anno.db)) %>% 
  subsetByOverlaps(., bsTranscript) %>% width %>% sum
utrs3Len = unlist(threeUTRsByTranscript(anno.db)) %>% 
  subsetByOverlaps(., bsTranscript) %>% width %>% sum
utrs5Len = unlist(fiveUTRsByTranscript(anno.db)) %>% 
  subsetByOverlaps(., bsTranscript) %>% width %>% sum
lenDfSum = data.frame(lenSum = c(cdsLen, intrnsLen, utrs3Len, utrs5Len))


df = data.frame(type = names(table(bsTranscript$region)), val = as.vector(table(bsTranscript$region)))
df = cbind(df, lenDfSum)
df = df[order(df$val, decreasing = F),]
df$type = factor(df$type, levels = df$type)

ggplot(df, aes(x = type, y = val/lenSum, fill = type)) +
  geom_col(color = "black") + 
  scale_fill_npg() +
  coord_flip() +
  theme_pub() +
  xlab("Type") +
  ylab("Scaled count [mean(length)]") +
  theme(legend.position = "none") 

Mean length normalization of binding sites per genomic region

8 Tables and numbers

Show code
### ============================================================================
### Numbers
### ----------------------------------------------------------------------------
###

peaksInitial = peaksInitial[!seqnames(peaksInitial) %in% "chrY"]
b1 = setRanges(bds, peaksInitial)
c1 = coverageOverRanges(b1, returnOptions = "merge_positions_keep_replicates") %>% mcols() %>% as.matrix()

peaksFiltered = peaksFiltered[!seqnames(peaksFiltered) %in% "chrY"]
b2 = setRanges(bds, peaksFiltered)
c2 = coverageOverRanges(b2, returnOptions = "merge_positions_keep_replicates") %>% mcols() %>% as.matrix()

b25 = setRanges(bds, peaksFilteredPerGene)
c25 = coverageOverRanges(b25, returnOptions = "merge_positions_keep_replicates") %>% mcols() %>% as.matrix()

b3 = setRanges(bds, peaksProcessed)
c3 = coverageOverRanges(b3, returnOptions = "merge_positions_keep_replicates") %>% mcols() %>% as.matrix()

b4 = setRanges(bds, peaksReproducible)
c4 = coverageOverRanges(b4, returnOptions = "merge_positions_keep_replicates") %>% mcols() %>% as.matrix()

b5 = setRanges(bds, bsGene)
c5 = coverageOverRanges(b5, returnOptions = "merge_positions_keep_replicates") %>% mcols() %>% as.matrix()

b6 = setRanges(bds, bsProt)
c6 = coverageOverRanges(b6, returnOptions = "merge_positions_keep_replicates") %>% mcols() %>% as.matrix()

b7 = setRanges(bds, bsTranscript)
c7 = coverageOverRanges(b7, returnOptions = "merge_positions_keep_replicates") %>% mcols() %>% as.matrix()

df = data.frame(nPeaks = c(length(peaksInitial),
                           length(peaksFiltered),
                           length(peaksFilteredPerGene),
                           length(peaksProcessed), 
                           length(peaksReproducible), 
                           length(bsGene),
                           length(bsProt), 
                           length(bsTranscript)),
                nTargets = c(length(subsetByOverlaps(gns, peaksInitial)),
                             length(subsetByOverlaps(gns, peaksFiltered)),
                             length(subsetByOverlaps(gns, peaksFilteredPerGene)),
                             length(subsetByOverlaps(gns, peaksProcessed)),
                             length(subsetByOverlaps(gns, peaksReproducible)),
                             length(unique(bsGene$geneID)),
                             length(unique(bsProt$geneID)),
                             length(unique(bsTranscript$geneID)) ),
                nXlinksMUT = c(sum(dplyr::select(as.data.frame(c1), ends_with("MUT"))),
                               sum(dplyr::select(as.data.frame(c2), ends_with("MUT"))),
                               sum(dplyr::select(as.data.frame(c25), ends_with("MUT"))),
                               sum(dplyr::select(as.data.frame(c3), ends_with("MUT"))),
                               sum(dplyr::select(as.data.frame(c4), ends_with("MUT"))),
                               sum(dplyr::select(as.data.frame(c5), ends_with("MUT"))),
                               sum(dplyr::select(as.data.frame(c6), ends_with("MUT"))),
                               sum(dplyr::select(as.data.frame(c7), ends_with("MUT")))
                ),
                nXlinksWT = c(sum(dplyr::select(as.data.frame(c1), ends_with("WT"))),
                              sum(dplyr::select(as.data.frame(c2), ends_with("WT"))),
                              sum(dplyr::select(as.data.frame(c25), ends_with("WT"))),
                              sum(dplyr::select(as.data.frame(c3), ends_with("WT"))),
                              sum(dplyr::select(as.data.frame(c4), ends_with("WT"))),
                              sum(dplyr::select(as.data.frame(c5), ends_with("WT"))),
                              sum(dplyr::select(as.data.frame(c6), ends_with("WT"))),
                              sum(dplyr::select(as.data.frame(c7), ends_with("WT")))
                )
)

df = format(df, big.mark = ",", decimal.mark = ".")
rownames(df) = c("CLS - PureCLIP", "CLS - Global filter", "CLS - Gene level filter", "BS - Merged", "BS - Reproducible", "BS - Gene", "BS - Protein", "BS - Transcript")
colnames(df) = c("CLS/BS (N)", "Targets (N)", "Xlinks (MUT)", "Xlinks (WT)")

kable(df, caption = "Processing Overview. CLS = Crosslink sites/ BS = Binding sites") %>% 
  kable_styling("striped") %>%
  scroll_box(width = "100%")
Processing Overview. CLS = Crosslink sites/ BS = Binding sites
CLS/BS (N) Targets (N) Xlinks (MUT) Xlinks (WT)
CLS - PureCLIP 2,452,405 10,899 9,139,819 12,921,765
CLS - Global filter 2,329,784 10,892 8,981,133 12,685,840
CLS - Gene level filter 438,570 10,458 4,095,895 5,668,280
BS - Merged 104,034 8,826 3,196,681 4,456,865
BS - Reproducible 96,852 8,186 3,140,187 4,367,252
BS - Gene 96,852 8,127 3,140,187 4,367,252
BS - Protein 93,806 7,523 2,860,559 4,038,548
BS - Transcript 93,235 7,513 2,847,185 4,020,148
Show code
### ============================================================================
### xlinks in bs per replicate per filter step
### ----------------------------------------------------------------------------
###
df = data.frame(
  x1 = colSums(c1),
  x2 = colSums(c2),
  x25 = colSums(c25),
  x3 = colSums(c3),
  x4 = colSums(c4),
  x5 = colSums(c5),
  x6 = colSums(c6),
  x7 = colSums(c7)
)
df = format(df, big.mark = ".", decimal.mark = ",")
colnames(df) = c("CLS - PureCLIP", "CLS - Global filter", "CLS - Gene level filter", "BS - Merged", "BS - Reproducible", "BS - Gene", "BS - Protein", "BS - Transcript")
rownames(df) = c("S1 - MUT", "S2 - MUT", "S3 - WT", "S4 - WT", "S5 - WT")
df = t(df)

kable(df, caption = "Xlinks in peaks/bs per replicate per filtering step") %>% 
  kable_styling("striped") %>%
  scroll_box(width = "100%")
Xlinks in peaks/bs per replicate per filtering step
S1 - MUT S2 - MUT S3 - WT S4 - WT S5 - WT
CLS - PureCLIP 4.406.096 4.733.723 4.421.934 4.497.462 4.002.369
CLS - Global filter 4.327.379 4.653.754 4.342.396 4.416.946 3.926.498
CLS - Gene level filter 1.945.206 2.150.689 1.969.342 1.994.887 1.704.051
BS - Merged 1.530.741 1.665.940 1.551.904 1.569.061 1.335.900
BS - Reproducible 1.505.501 1.634.686 1.521.256 1.538.149 1.307.847
BS - Gene 1.505.501 1.634.686 1.521.256 1.538.149 1.307.847
BS - Protein 1.370.501 1.490.058 1.412.684 1.418.379 1.207.485
BS - Transcript 1.364.086 1.483.099 1.406.247 1.411.882 1.202.019

9 Binding patterns

9.1 Flanking regions next to exons

Show code
exn = exons(anno.db)
export(granges(exn), "./data/exn.bed", format = "BED")
# define flanking regsions 
leftEdge = flank(exn, width = 100, start = TRUE)
export(granges(leftEdge), "./data/leftEdge.bed", format = "BED")
rightEdge = flank(exn, width = 100, start = FALSE)
export(granges(rightEdge), "./data/rightEdge.bed", format = "BED")
Show code
# split BS in left /right
# splti BS in left/ right/ both

# split right side
rightOLS = findOverlaps(bsTranscript, leftEdge) %>% as.data.frame()
rightIDX = unique(rightOLS$queryHits)

# split left side
leftOLS = findOverlaps(bsTranscript, rightEdge) %>% as.data.frame()
leftIDX = unique(leftOLS$queryHits)

# make unique assignment
bothSidesIDX = rightIDX[rightIDX %in% leftIDX]
rightIDX = rightIDX[! rightIDX %in% bothSidesIDX]
leftIDX = leftIDX[! leftIDX %in% bothSidesIDX]

# split deep intronic
deepIntronIDX = 1:length(bsTranscript)
deepIntronIDX = deepIntronIDX[! deepIntronIDX %in% c(rightIDX, leftIDX, bothSidesIDX)]

# annotate ranges
rightBS = bsTranscript[rightIDX]
leftBS = bsTranscript[leftIDX]
bothSidesBS = bsTranscript[bothSidesIDX]
deepIntronBS = bsTranscript[deepIntronIDX]
mcols(rightBS)$intronLocation = "Near exon 3'"
mcols(leftBS)$intronLocation = "Near exon 5'"
mcols(bothSidesBS)$intronLocation = "Near exon both"
mcols(deepIntronBS)$intronLocation = "Deep intron"

bsTranscript = c(rightBS, leftBS, bothSidesBS, deepIntronBS)
bsTranscript = sortSeqlevels(bsTranscript)
bsTranscript = sort(bsTranscript)
mcols(bsTranscript)$intronLocation = ifelse(bsTranscript$region != "intron", NA, bsTranscript$intronLocation)

df = data.frame(intronLocation = bsTranscript$intronLocation) %>%
    table() %>%
    as.data.frame() 

ggplot(df, aes(x = intronLocation, y = Freq, fill = intronLocation)) +
    geom_col() +
    geom_text(aes(label = myFormat(Freq)), vjust = -0.3) +
    scale_fill_npg() +
    theme_pub() +
    theme(legend.position = "none") +
    labs(
        title = "Enhanced binding spectrum",
         y = "Count (#N)",
         x = "Location")

Binding sites in introns split by near exon (within 100nt from splice sites) and deep-intronic regions

Show code
exn = exons(anno.db, columns = c("exon_id", "gene_id", "exon_name"))
names(exn) = exn$exon_id
export(granges(exn), "./data/exn.bed", format = "BED")

exnBeforeIdx = follow(bsTranscript, exn) 
exnBeforeIdx[is.na(exnBeforeIdx)] = 557990
exnBeforeDist = distance(bsTranscript, exn[exnBeforeIdx])

exnAfterIdx = precede(bsTranscript, exn)
exnAfterIdx[is.na(exnAfterIdx)] = 557990
exnAfterDist = distance(bsTranscript, exn[exnAfterIdx])


distDf = data.frame(distBefore = exnBeforeDist, distAfter = exnAfterDist) %>%
    mutate(side = ifelse(distBefore < distAfter, "before", "after")) %>%
    mutate(distToExon = ifelse(distBefore < distAfter, distBefore, distAfter)) %>%
    mutate(positionTag = ifelse(distToExon > 100, "deep intron", "near exon")) %>%
    mutate(exonID = ifelse(side == "after", exnAfterIdx, exnBeforeIdx)) %>%
    mutate(bsID = names(bsTranscript)) %>%
    mutate(exonName = exn$exon_name[exonID]) %>%
    select(positionTag, side, distToExon, exonID, bsID, exonName)

mcols(bsTranscript) = cbind(mcols(bsTranscript), distDf)

### export
export(bsTranscript, "./data/bsTranscript.bed", format = "BED")
save(bsTranscript, file = "./data/bsTranscript.rda")

9.2 Distances

To find patterns in binding site spacing, we calculated the distance from each binding sites to its nearest neighbor.

Show code
dist = distanceToNearest(bsTranscript) %>% as.data.frame()
bsTranscript$dist = dist$distance
ggplot(dist, aes(x = log10(distance+1))) + 
  geom_histogram(bins = 100, color = "black") + 
  theme_nice() +
  labs(
    title = "Distance to nearest binding site",
    x = "Distance +1 (nt) [log10]",
    y = "Count") 

Distance from each binding site to the next closest neighbor.

Show code
ggplot(dist, aes(x = distance)) + 
  geom_histogram(binwidth = 1, color = "black") + 
  xlim(-1,50) +
  theme_nice() +
  labs(
    title = "Distance to nearest binding site",
    x = "Distance (nt) [0-50]",
    y = "Count") +
  geom_vline(xintercept = 7, linetype = "dashed")

Distance from each binding site to the next closest neighbor in a range of 50 nt.

10 Session Information

Show code
sessionInfo()
R version 4.2.1 (2022-06-23)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur ... 10.16

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] grid      stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] waffle_0.7.0            ggsci_3.0.0             ggpointdensity_0.1.0   
 [4] tidyr_1.3.0             tibble_3.2.1            patchwork_1.1.2        
 [7] ggtext_0.1.2            forcats_1.0.0           ComplexHeatmap_2.14.0  
[10] BindingSiteFinder_1.7.8 viridis_0.6.3           viridisLite_0.4.2      
[13] gridExtra_2.3           ggrepel_0.9.3           knitr_1.43             
[16] kableExtra_1.3.4        GenomicFeatures_1.49.7  UpSetR_1.4.0           
[19] reshape2_1.4.4          dplyr_1.1.2             AnnotationDbi_1.59.1   
[22] Biobase_2.57.1          ggplot2_3.4.2           rtracklayer_1.57.0     
[25] GenomicRanges_1.49.1    GenomeInfoDb_1.33.10    IRanges_2.31.2         
[28] S4Vectors_0.35.4        BiocGenerics_0.43.4    

loaded via a namespace (and not attached):
  [1] colorspace_2.1-0            rjson_0.2.21               
  [3] circlize_0.4.15             XVector_0.37.1             
  [5] GlobalOptions_0.1.2         gridtext_0.1.5             
  [7] clue_0.3-64                 rstudioapi_0.14            
  [9] farver_2.1.1                bit64_4.0.5                
 [11] fansi_1.0.4                 xml2_1.3.4                 
 [13] codetools_0.2-19            extrafont_0.19             
 [15] doParallel_1.0.17           cachem_1.0.8               
 [17] polyclip_1.10-4             jsonlite_1.8.5             
 [19] Cairo_1.6-0                 Rsamtools_2.13.4           
 [21] Rttf2pt1_1.3.12             cluster_2.1.4              
 [23] dbplyr_2.3.2                png_0.1-8                  
 [25] ggdist_3.3.0                ggforce_0.4.1              
 [27] compiler_4.2.1              httr_1.4.6                 
 [29] Matrix_1.5-4.1              fastmap_1.1.1              
 [31] cli_3.6.1                   tweenr_2.0.2               
 [33] htmltools_0.5.5             prettyunits_1.1.1          
 [35] tools_4.2.1                 gtable_0.3.3               
 [37] glue_1.6.2                  GenomeInfoDbData_1.2.9     
 [39] rappdirs_0.3.3              Rcpp_1.0.10                
 [41] vctrs_0.6.3                 Biostrings_2.65.6          
 [43] svglite_2.1.1               extrafontdb_1.0            
 [45] iterators_1.0.14            xfun_0.39                  
 [47] stringr_1.5.0               rvest_1.0.3                
 [49] lifecycle_1.0.3             restfulr_0.0.15            
 [51] XML_3.99-0.14               zlibbioc_1.43.0            
 [53] MASS_7.3-60                 scales_1.2.1               
 [55] hms_1.1.3                   MatrixGenerics_1.9.1       
 [57] parallel_4.2.1              SummarizedExperiment_1.27.3
 [59] RColorBrewer_1.1-3          yaml_2.3.7                 
 [61] curl_5.0.1                  memoise_2.0.1              
 [63] biomaRt_2.53.3              reshape_0.8.9              
 [65] stringi_1.7.12              RSQLite_2.3.1              
 [67] highr_0.10                  BiocIO_1.7.1               
 [69] foreach_1.5.2               filelock_1.0.2             
 [71] BiocParallel_1.31.13        shape_1.4.6                
 [73] rlang_1.1.1                 pkgconfig_2.0.3            
 [75] systemfonts_1.0.4           matrixStats_1.0.0          
 [77] bitops_1.0-7                distributional_0.3.2       
 [79] evaluate_0.21               lattice_0.21-8             
 [81] purrr_1.0.1                 labeling_0.4.2             
 [83] GenomicAlignments_1.33.1    htmlwidgets_1.6.2          
 [85] bit_4.0.5                   tidyselect_1.2.0           
 [87] GGally_2.1.2                plyr_1.8.8                 
 [89] magrittr_2.0.3              R6_2.5.1                   
 [91] magick_2.7.4                generics_0.1.3             
 [93] DelayedArray_0.23.2         DBI_1.1.3                  
 [95] pillar_1.9.0                withr_2.5.0                
 [97] KEGGREST_1.37.3             RCurl_1.98-1.12            
 [99] crayon_1.5.2                utf8_1.2.3                 
[101] BiocFileCache_2.5.2         rmarkdown_2.22             
[103] GetoptLong_1.0.5            progress_1.2.2             
[105] blob_1.2.4                  digest_0.6.31              
[107] webshot_0.5.4               munsell_0.5.0